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ABSTRACT 

The mission of the Universal Protein Resource 
(UniProt) (http://www.uniprot.org) is to support 
biological research by providing a freely accessible, 
stable, comprehensive, fully classified, richly 
and accurately annotated protein sequence 
knowledgebase. It integrates, interprets and stand- 
ardizes data from numerous resources to achieve 
the most comprehensive catalogue of protein se- 
quences and functional annotation. UniProt com- 
prises four major components, each optimized for 
different uses, the UniProt Archive, the UniProt 
Knowledgebase, the UniProt Reference Clusters 
and the UniProt Metagenomic and Environmental 
Sequence Database. UniProt is produced by the 
UniProt Consortium, which consists of groups 
from the European Bioinformatics Institute (EBI), 
the SIB Swiss Institute of Bioinformatics (SIB) and 
the Protein Information Resource (PIR). UniProt is 
updated and distributed every 4 weeks and can be 
accessed online for searches or downloads. 

INTRODUCTION 

The UniProt's goal is to provide the most comprehensive 
resource for protein sequence and functional annotation. 
The four UniProt databases are optimized for different 
uses as follows: the UniProt Knowledgebase 
(UniProtKB) is an expertly curated database; the 
UniProt Archive (UniParc) (1) is a comprehensive 
sequence repository, reflecting the history of all protein 
sequences not only in the UniProtKB but also in all 
source databases; the UniProt Reference Clusters 
(UniRef), which merge closely related sequences based 
on sequence identity to facilitate sequence similarity 



searches (2) and the UniProt Metagenomic and 
Environmental Sequence (UniMES) database, which was 
created to cater for the developing area of metagenomics. 
The aim of this article is to provide a status report on 
UniProt activities and some of our plans for the near 
future that will enable us to successfully continue to play 
a critical role in bioinformatics discovery in the genomic 
and proteomic era. 

NEW AND ONGOING DEVELOPMENTS 

UniProtKB reorganization 

As the cost of sequencing continues to fall, the number of 
organisms with complete proteomes in UniProtKB is 
increasing. It is also becoming more and more common 
in the scientific community for many groups to sequence 
the complete proteomes of the same organism or multiple 
strains of an organism. This means that users are pre- 
sented with an increasingly large data set, which can be 
difficult to navigate and are largely redundant in biolo- 
gical knowledge. In response, the UniProt Consortium is 
developing a concept to provide a set of sequences from 
selected species based on UniProtKB manually reviewed 
entries, the Reference Proteomes and the Representative 
proteomes (3,4). We re-evaluated our manual annotation 
priorities, and re-defined our organism focus list. For 
more information, please see http://www.uniprot.org/ 
program. Curators continue to define complete proteomes 
and reference proteomes as they become available. To 
ensure comprehensiveness, several changes were required 
in the UniProt import pipeline. Historically, the great 
majority of UniProt sequences are based on translations 
of genome sequence submissions to the International 
Nucleotide Sequence Database Consortium (INSDC) 
(5). Our longstanding collaboration has been deepened 
to include the joint definition of complete genomes and 
the grouping together of all the genome submissions 
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(e.g. individual chromosomes, organelles) for an organism 
that originate from the same sequencing project under one 
unique set accession. In addition, we have extended the 
import pipeline to include Ensembl (6) and Ensembl 
Genomes (7) sequences. This was to ensure comprehen- 
siveness, as the full and/or up-to-date annotation of 
genomes is sometimes not submitted to the INSDC, for 
example, Apis mellifera (http://metazoa.ensembl.org/ 
Apis_mellifera/Info/Index). The Ensembl sequences are 
mapped to their UniProtKB counterparts under stringent 
conditions, requiring 100% identity for 100% of the 
length of the two sequences. Ensembl sequences that are 
absent from UniProtKB are imported into UniProtKB/ 
TrEMBL. The UniProtKB entries provide a cross- 
reference back to the appropriate Ensembl record(s) 
where available, enabling an easy transition to the 
genomic view. The one exception to this approach is for 
the Homo sapiens complete proteome, where there are 
some cross-references to Ensembl in the UniProtKB/ 
Swiss-Prot entries that do not follow the aforementioned 
criteria. This is because of the fact that there are different 
evidence and sources for the sequence in the two 
resources. The cross-reference mapping is, however, 
enhanced with the usage of HUGO Gene Nomenclature 
Committee (HGNC) (http://www.genenames.org) identi- 
fiers. Of the 20224 UniProtKB/Swiss-Prot entries, 18 696 
entries have at least one sequence that has 100% identity 
for 100% of the length of an Ensembl transcript. The 
UniProt curators and the Ensembl curators and gene 
builders are progressively working through the rest of 
the differences, correcting them where appropriate and 
documenting agree-to-disagree decisions. This is part 
of the Consensus CDS (CCDS) project, which is a collab- 
orative effort to identify a core set of human and mouse 
protein coding regions that are consistently annotated and 
of high quality (http://www.ncbi.nlm.nih.gov/CCDS/). 
The long-term goal is to support convergence towards a 
standard set of gene annotations. UniProt has also 
extended the pipeline to import RefSeq (8) sequences, 
and we are currently evaluating how to combine this 
data with the existing UniProtKB and Ensembl data. All 
of these developments have had the side benefit of estab- 
lishing a close and mutually beneficial collaboration with 
the Ensembl and RefSeq groups. We import their se- 
quences while they import our annotations into their 
records (in particular the protein nomenclature and 
sequence feature annotations), and their prediction 
pipelines learn from our manually reviewed and experi- 
mentally proved sequences. There is a consensus that 
we should all provide the same (in sequence and annota- 
tion) complete proteomes and to collaborate on the defin- 
ition of Reference proteomes. Another outcome of this 
collaboration is the ongoing development of genome an- 
notation standards (including protein nomenclature), and 
the promotion of these standards by the sequencing 
community (9). 

UniProt biocuration 

UniProt's central focus is the annotation — both manual 
and automatic — of the UniProt Knowledgebase. 



Manual curation challenge 

Historically, the sequences from the same gene (and more 
than one when the resulting protein sequences were 100% 
identical) from the same organism were merged into one 
UniProtKB/Swiss-Prot entry. Discrepancies between 
sequence reports were identified, and the underlying 
causes, such as alternative splicing, natural variations, 
frameshifts and so forth, were annotated. Journal 
articles provided the main source of experimental know- 
ledge, with the full text of each article being read and the 
information extracted. The aim of this approach was to 
provide a central hub of information for each protein, but 
it also meant that many UniProtKB/Swiss-Prot entries 
contain sequences and annotations from many strains. 
In the era of complete genomes and proteomes at the 
strain level for so many organisms, UniProt has 
modified this policy. We are now providing entries that 
contain the protein products from a particular gene from a 
particular species + strain with the experimental literature 
being annotated to that species + strain and propagated as 
appropriate to other species and strains, ideally through 
the UniRule pipeline (see later in the text). This has the 
advantage of providing a gold standard experimental set 
in UniProtKB/Swiss-Prot and automatically propagating 
appropriate annotation to the ever increasing number of 
complete proteomes for which there is no experimental 
data in UniProtKB/TrEMBL. 

Automatic annotation approaches 

UniProt has developed two complementary systems to 
automatically annotate the protein sequences in 
UniProtKB/TrEMBL. The first system, UniRule, which 
incorporates the HAMAP (10), RuleBase (11) and PIR 
Rule (12,13) systems, consists of annotation rules 
created and monitored by experienced curators. Each an- 
notation rule specifies a number of annotations and con- 
ditions which must be satisfied for that annotation to be 
applied. These conditions may include family membership 
[as indicated by a match to a family defined by InterPro 
(14)], taxonomic constraints and the presence of particular 
sequence features. Rules are created by curators based on 
information from experimentally characterized template 
entries, and their predictions evaluated against the 
content of manually annotated UniProtKB/Swiss-Prot 
entries, which serve as the gold standard. With each 
UniProt release, the monitoring system sends those rules 
that are inconsistent with UniProtKB/Swiss-Prot annota- 
tion to curators for review. This ensures that only 
high-quality predictions are added and prevents propaga- 
tion of potentially erroneous data. The second system, the 
Statistical Automatic Annotation System [SAAS, previ- 
ously named Spearmint (15)] supplements the labour- 
intensive-UniRule system and generates automatic rules 
for functional annotation from UniProtKB/Swiss-Prot 
entries using the C4.5 decision-tree algorithm. This algo- 
rithm uses entropy gain to find the most concise rule for 
an annotation based on the criteria of sequence length, 
InterPro-group membership and taxonomy. Generating 
rules 'on the fly' ensure their evolution along with the 
UniProtKB with little or no manual intervention while 
providing seed rules for exploitation in the UniRule 
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system. This combined approach produces annotation for 
34% of UniProtKB/TrEMBL entries at the current time. 
All predictions are refreshed with each UniProtKB release 
to ensure the latest state-of-knowledge predictions. 

Gene Ontology annotation 

UniProt continues to be a major provider of Gene 
Ontology (GO) annotations to the GO Consortium (16). 
UniProt curators are actively involved in curating 
UniProtKB entries with GO terms, providing both 
high-quality manual GO annotations in addition to their 
contributions to electronic GO annotation pipelines. 
Manual GO annotations are made during the UniProt 
literature curation process, and, at the time of writing, 
almost 214000 annotations have been manually assigned 
to >37 000 proteins by UniProt curators. The curators 
also supply information to entries that is subsequently 
used in electronic GO annotation pipelines, such as 
UniProt keywords2GO, UniProt subcellular 
location2GO and InterPro2GO. A new automatic 
pipeline, UniPathway2GO [a collaboration between 
UniProt, INRIA (Rhone-Alpes) and Laboratoire 
d'Ecologie Alpine (Grenoble) (17)], was initiated in May 
2012 that provides GO annotations describing the meta- 
bolic pathways that proteins are involved in. Altogether, 
the UniProt supplied automatic annotation pipelines 
provide 42.5 million annotations to >14 million 
proteins. UniProt also incorporates annotations from 
other GO Consortium members and affiliates and 
displays these annotations in the relevant UniProt 
entries. Currently, the UniProt-GO annotation project 
provides GO annotations for 65% of UniProt entries. 

Highlighting the UniProt website 

As a result of recent usability testing with the UniProt user 
community, we would like to highlight the following 
features on the UniProt website (http://www.uniprot. 
org), which is the main access point for the data available 
in the UniProt databases and the tools to explore it. The 
tabbed bar on the top of each page includes multiple tools, 
such as free text 'Search', 'BLAST' sequence similarity 
search, 'Align' for multiple sequence alignment, 
'Retrieve' for batch downloads and 'ID mapping'. ID 
mapping is a tool to convert UniProt identifiers to corres- 
ponding identifiers from a number of other databases 
available in a dropdown list or vice versa. There is also 
functionality available to help users personalize their 
experience with the website. For example, the search 
results page contains the 'Customize' button above the 
results table to help modify the table. This allows 
removal or addition of data to the results table from a 
vast selection of available columns, such as Gene 
Ontology, Cross-references, Sequence features and so 
forth, to help users find their proteins of interest. Users 
can then click on checkboxes at the left of the results table 
to add their proteins of interest to a selection cart that 
appears at the bottom of the page. The cart provides 
tools to help analyse or download the selected entries 
and saves selections across searches. The protein entry 
page contains the 'Customize order' button on the grey 



navigation tool bar that allows users to reorder sections 
within the entry. 



DATABASE ACCESS AND FEEDBACK 

The http://www.uniprot.org website (18) is the primary 
access point to our data and documentation and offers 
tools, such as full text and field-based text search, 
sequence similarity search, multiple sequence alignment, 
batch retrieval and database identifier mapping. The 
home page features a site tour as a quick introduction 
for novice users. The full text search allows quick and 
easy searching without previous knowledge of our data 
or search syntax. The results are sorted by relevance, 
and search suggestions are provided, where possible, to 
help filter searches that yield too many or no results. 
More complex queries can be built with the field-based 
text search, either iteratively with a query builder or by 
entering them manually in the query field, which can be 
faster and more powerful (http://www.uniprot.org/help/ 
text-search). Searching with ontology terms is assisted by 
auto-completion, and search results can be browsed by 
ontologies. The display of the result sets, as well as 
database entries, is configurable; columns can be added 
to or removed from the result table to see more functional 
annotation than is available in the default display. 
Sequence similarity search results can be filtered by 
taxonomy to obtain a quick overview of the taxonomic 
distribution of the results, and the sequence annotations of 
the matched entries can be projected onto the sequence 
alignments to see at a glance whether important positions 
are conserved. The site has a simple and consistent URL 
scheme that allows the bookmarking of all searches to 
repeat them at a later time. All result sets can be down- 
loaded to offer users the possibility to retrieve customized 
data sets. However, large downloads are given low priority 
to ensure that they do not interfere with interactive 
queries, and they can, therefore, be slow compared with 
downloads from the UniProt FTP server. We, therefore, 
recommend downloading complete data sets from ftp. 
uniprot.org/pub/databases. The website offers various 
download formats (e.g. plain text, extensible mark-up 
language, RDF, FASTA, GFF), which depend on the 
chosen data set. The tab-delimited and Excel formats 
can be customized by selecting the desired columns in 
the graphical view of the result table. All data are also 
available in RDF (http://www.w3.org/RDF/), a W3C 
standard for publishing data on the Semantic Web. Both 
data and search results can also be accessed programmat- 
ically, either through simple HTTP (REST) requests 
(http://www.uniprot.org/faq/28) or our Java API 
(UniProtJAPI) (19). 

Although the UniProt website provides a query inter- 
face for all UniProt data, some users also require facilities 
to search across related data in different databases. We 
have, therefore, set-up a BioMart (20) (http://www 
.biomart.org) instance at http://www.ebi.ac.uk/uniprot/ 
biomart/martview that allows complex queries between 
UniProt and other data resources, such as PRIDE (21), 
Ensembl and InterPro. To offer users even more 



D46 Nucleic Acids Research, 2013, Vol. 41 , Database issue 



flexibility, we are going to provide a SPARQL Protocol 
and RDF Query Language (SPARQL) (http://www.w3. 
org/TR/rdf-sparql-query/) end-point for all our data that 
can be linked with any remote data resource that has a 
SPARQL end-point, using SPARQL 1.1's federated query 
capabilities. This new service is available for beta testing at 
http://beta.sparql.uniprot.org/. 

Your feedback is extremely valuable to help us improve 
our databases and services in terms of accuracy and 
usability. Please contact us if you have questions or sug- 
gestions through http://www.uniprot.org/contact or email 
us directly at help@uniprot.org. You can submit new data 
or updates at http://www.uniprot.org/help/submissions. 
Extensive documentation on how to best use our 
resource is available at http://www.uniprot.org/help/. 
UniProt is freely available for both commercial and 
non-commercial use. Please see http://www.uniprot.org/ 
help/license for details. New releases are published every 
4 weeks except for UniMES, which is updated only when 
the underlying source data are updated. Release statistics 
are available at http://www.uniprot.org. 
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